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Abstract 

A methodology, called the force, supports the constitution of programs to be 
executed m parallel by a force of processes The number of processes m the 
force is unspecified, but potentially very large The force idea is embodied in 
a set of macros which produce multiprocessor Fortran code and has been stu- 
died on two shared meniorv multipiocessors of fan iy different character The 
method has simplified the writing of highly parallel programs within a limited 
class of paiallel algonthms and is being extended to cover a broader class 
This papei deals with the individual parallel constiucts which comprise the 
force met hodolog) Of central concern are their semantics, implementation 
on different aichitect ures and performance implications 
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Scientific Research under Grant No AFOSR 85-1089 while the author was in residence at ICASE, NAS\ 
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Conceptual Basis for the Force 

The foice [1] methodology for parallel programming arose in trying to 
produce high performance parallel programs in a shai ed-memory multiproces- 
sor running up to 200 processes on the same usei program [2] Multiprogram- 
ming was not an issue, and all emphasis was on single problem solution speed 
Parti) for performance measurement purposes and partly for program 
manageability, a programming style emerged in which a single piece of code 
was wntten which could be executed by a force of processes in parallel The 
number of processes constituting the force is constant during execution but is 
bound as late as the beginning of execution, and ma) be one 

The force technique insulates the programmer fiom all process manage- 
ment and leaves him the issues involving process synchronization Since 
processes aie established by a program independent driver at the beginning 
of execution time, parallelism is intioduced at the top of the procedure 
hierarchy This has the effect of insulating the usei from parallelism issues 
with results similar to those obtained by encapsulating parallelism below a 
paiticular Jewel in the procedure hierarchy The stud) of techniques for using 
the force m a program is essentially a study of synchronization mechanisms 
which are independent of the number and identities of the processes syn- 
chionized 

Sex era! advantages arise out of independence from the number of 
processes It is not necessary to design algorithms with a detailed depen- 
dence on the, potentially very large, number of piocesses executing them 
The choice iff the optimal number of processes can be made at run time on 
the basis of system hardware configuration and load Since complete 
independence from the number of processes implies correct execution with 
only one piocess, the issues of arithmetic correctness and multi-process syn- 
chronization can be separated in the testing of a progiam 

Statements wntten in a force ptogram are implicitly executed by all 
processes m parallel Variables appearing in statements are divided into local 
vanables, having separate instances for each process, and global variables, 
shaied among all processes of the force Ail assignment statement, for exam- 
ple. may combine the values of global and local variables to produce a local oi 
global result If the result is local, no assignment conflict is possible If it is 
global, then assignment conflict must be prevented, either by allocation of 
disjoint sections of a global data structure to multiple processes or by syn- 
c [lionizing the assignment across processes, say by enclosing it in a critical 
section or In using producer/consumer synchronization on the variable 
assigned Library or user subroutines which are either free of side effects or 
carefully sv nc hronized can be invoked in parallel, one copy for each process 

One wav m which disjoint sections of a global data structure, specifically 
an arrav, may be allocated to multiple processes is to schedule distinct index 
values m a DOALL across processes Index values mav either be assigned 
statically to piocesses once the number of processes is known, in which case 
we speak of a prescheduled DOALL, or processes mav dynamically schedule 
themselves by obtaining distinct values of a global index variable as they 
become available to execute the loop body, known as a self-scheduled 
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DOALL In either case, the body of the DOALL is executed once for each 
index value by some one of the processes The choice of prescheduling or 
self-scheduling may impact performance as a result of uneven workload divi- 
sion or of conflict on access to the global index variable 

The programming language associated with the force consists of some 
simple extensions to the Fortran language, which are currently implemented 
as macros which are expanded by a language independent prepiocessor The 
target Fortian system must, of course, include ways of creating multiple 
processes and of supporting synchronized access to global variables A 
currently operational set of macros produces Fortran for the HEP computer 
[3], built bv Denelcor, Inc , and a set is being constructed for the Flex/32 [lj, 
built by Flexible Computer Corporation The macin'- interact through the 
variables of a parallel environment, which contains some general information 
such as the number of processes and some machine dependent items 


Parallelism Constructs of the Force 

The macros curiently constituting the force can be divided into several 
classes, as shown in Fig 1 The first class deals w'lth paiallel program struc- 
ture The macros Force and Forcesub respectively begin parallel main pro- 
grams and parallel subroutines They make the paiallel environment vari- 
ables available to the macros within that program module as well as making 
the number of processes and a unique identifier for the current process avail- 
able to the user at run time An End Declarations macro maiks the beginning 
of executable code and provides target locations for declarations and start up 
code which may be generated by the macros A Join macro terminates the 
parallel main program It is the last statement exec cited by all processes of 
the force 

Macios of the second class deal with variable declaration This class 
currently includes only Global and Local macros Global variables aie associ- 
ated with Fortran common while local variable's are ordinary Fortran vari- 
ables local to a separately compiled program module Sharing of local vari- 
ables among several program modules, but local to one process, can onlv be 
accomplished bv paiameter passing The static allocation flavor of Fortran 
makes it difficult to build a structure of common vanables with one instance 
for each piocess when the number of processes is not known until execution 
time 

Macros of another class distribute work across pioc esses The most fami- 
liar construct is the DOALL, which is employed when instances of a loop 
body for dilferent index values are independent and can thus be executed in 
any order Two versions are provided The Presetted DO divides index values 
among processes in a fixed manner which depends onlj on the index range 
and the number of processes The Selfsched J)0 allows processes to schedule 
themselves over index values by obtaining the next available value of a 
shared index as the} become free to do work For situations in which it is 
desirable to parallelize over both indices of a doubly nested loop, both 
prescheduled, PreJDO, and self scheduled, SelfJDO , macros are available 
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M acros associated with program structure 

Force <name> of <# procs> ident <proc #> 

< declarations^ 

End declarations 

<foice program > 

Join 

Forccsiil) <name> of <#procs> ident <proc #> 

< dec la rat ions > 

End header 

<subroutme bodv> 

RETURN 

Foi cecal 1 <name>( < paramet ers> ) 

Decollation macros 

Global <\anable names > 

Local < Fort ran declai at ion > 

Macros spec if} mg parallel execution 
Pease on <\ariable> 

<code block> 

Usect 

< code block > 

End pease 

[Pre|SeIfjsc bed DO <n> <var>= <il>, <i2>, <i3> 

< loop 1 ) 0 ( 1 } > 

<n> End [pre^elfjsched DO 

8} nchroni/ing macros 
Bai rier 

< code bloc k > 

End ban ler 

Critical $<\ ariable> 

< code bloc k > 

End ci it ical 

Produce <variable> = <e\pression> (pioducer) 

= Use( < vat iable> (consumer) 


Figure 1 Specific Macios for a Force Program 

Independence of the loop body instances ovei both indices is, of course, 
required foi correct operation A similar construct is the parallel case, Pease, 
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which distributes different single stream code blocks o\er the processes of the 
force Execution conditions can be associated with each block, and any 
number of these conditions may be true simultaneously No order of evalua- 
tion of the conditions is specified, and each will be evaluated by one arbi- 
tiarily selected process Thus conditions depending only on global variables 
are most meaningful 

At the heart of the force methodology are the synchronization macros 
They characterize the approach to parallel programming and provide the 
means for controlling the force so that coherent and deterministic computa- 
tion can be performed Two subclasses of sync hroni/ation are control flow 
oriented synchronizations and data oriented synchronizations The key con- 
trol oriented synchronization is the barrier since it provides control of the 
entire force Its semantics are that all processes must execute a Barrier 
macro before one arbitrarily chosen process executes the code block between 
Barrier and End Barrier When the code block is complete, the entire force 
begins execution at the statement following the EndBairier Although all 
but one process aie temporarily suspended by a barrier, no process termina- 
tion or creation takes place and all local process states are preseived across 
the barrier Operations which depend on the past computation, or determine 
the future progress, of the entire iorce are typically enclosed in a barrier 

Another control based synchronization is the critical section, familiar 
fiom the operating systems literatuie Statements between 
Critical < vanable> and End Cntical may only be executed by one process of 
the force at a time This mutual exclusion extends to any other critical sec- 
tion with the same associated variable Data onented synchronization is pro- 
vided by the elementary producer-consumer mechanism, in which global van- 
ables have a binary state, full or empty, as well as a value Execution by 
some process of the macro, Produce < variable^ = < expression > , waits for 
the callable to be in the empty state, sets its value to that of the expression 
and makes it full, all in a manner which is atomic with respect to the progress 
of any othei process Similarly , the macro, l : ie( <C ra> iablet> ), appearing in an 
expression let urns the value of the variable when it becomes full and sets it 
empty Yaiiables in the wrong state may cause these macros to block the 
progress of a process Auxiliary macros for full/empty variables are 
Purge < variable > , which sets a variable empty regardless of its previous 
state, and Copy(<£ variable^ ), which waits for the variable to be full and 
returns its value but does not empty it 

A majoi weakness in the current set of foice macros is that it does not 
smoothly support decomposition of a program into parallel components on 
the basis of functionality The Pease macio offers the rudiments of this, but 
only allows one process to execute each of the parallel functions What is 
desired is a macro, Resolve, which will resolve the foice into components exe- 
cuting diffeient parallel code sections The section of code for each com- 
ponent would start with Component <Cname> strength < number^ , which 
would name the component and specify the fiaction of the force to be 
devoted to this component The component strengths would be estimated by 
the programmer on the basis of any knowledge available about the 
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computational complexity of each component A macio, Unify, would reunite 
the components, into a single force The implementation of Resolve is compli- 
cated by the conflicting demands of generality and efficiency If the number 
of components is larger than the number of processes in the force, then 
inter-component synchronization may deadlock unless the components are 
co-scheduled over the available processes An implementation which pro- 
duces process rescheduling at every possible deadlock point and is still 
efficient when the number of processes exceeds the number of components is 
under development 

Incorporation of a Resolve macro would make it useful to extend the bar- 
rier idea A barrier should be able to specify whether only the processes in 
the current component are to be blocked or whether all processes m the 
parent force aie to participate In the case of recursively nested Resolve con- 
stiucts, the barrier might, specify a nesting level relative to the one in which 
it appears 

The Resolve idea promises a mechanism foi functional decomposition of 
programs into parallel components, but there is one more capability of paral- 
lel programming environments with explicit process management which is not 
addiessed b> the force This is the ability to give away work to "available" 
processes in a dynamic manner during execution This ability is most called 
for b) tree algorithms and dynamic divide-and-conquer methods It would be 
desirable for the force to contain a mechanism for efficiently handling such 
algorithms without making the user responsible for explicit process manage- 
ment or losing the benefits of independence of the number of piocesses A 
mechanism i elated to resolve might, be applied at each tree node but could 
lead to mmh piocess management, oveihead in cases where the correct thing 
to do is meielv to traverse a subtree with the one remaining process 


Interrelationships Between the Primitives 

Th e semantics of the parallelism constructs in the force imply certain 
restrictions on the wav they are used together in a progiam Several of the 
constiucts lest net execution to a single stieam within some code block Bar- 
rier and Pease limit execution of enclosed blocks to a single process while cnt- 
ual section code is eventually executed bj all processes, but onlv one at a 
time Thus constiucts which depend on multiple, simultaneous execution, 
such as DOAI L, Pease or Barrier should not appeal within such blocks A 
critical section within a Barrier is meaningless, but critical sections have 
definite use within two 01 more code blocks of a Pease construct Nested crit- 
ical sections have meaning when the associated locking vanables are different 
Data oriented synchronization primitives may occui within singly executed 
code without restriction, other than the natural possibility of deadlock In 
fact, initialization of full/empty variables is usually done within a singly exe- 
cuted block 

Parallel loops do not restrict the execution of their bodies to a single pro- 
cess, but they do limit execution of the body for each index value to one pro- 
cess Thus constructs which depend on full parallel execution cannot appear 
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within DOALLs Th ese include Barrier, Pease and other DOALLs The 
lnconsistem y m the parallelism requirements of nested DOALLs is the reason 
for suppljing multiple index DOALLs for parallel execution of loop bodies 
which are independent over the Cartesian product of two or more index sets 
Critical sections, Produce and Use are quite useful within DOALLs and often 
lead to programs m which the distributed natuie of synchronization reduces 
its efTect on program performance 

Subroutine invocation within a force program can be done either with a 
Forcecall or an ordinary Fortran CALL Only the Foicecall makes the paral- 
lel environment. available to the subroutine called Since a force subroutine 
invoked bv Forcecall assumes that all processes of the force will enter it, a 
Forcecall must not appear within a code body in which parallel execution has 
been restricted Thus, Forcec-alls are not meaningful within Harrier, Pease, 
Critical or DOALL constructs An ordinary CALL implies execution of a sub- 
routine in single stream on behalf of one or more processes Since any For- 
tran based parallel system must support multiple independent execution of 
subroutines, such as those in the mathematical library, subroutines must 
hav e separate local variable states for all processes executing them An ordi- 
nary Fortran subroutine or function call may thus appear within any code 
section of a force program The subroutines or functions so invoked contain 
no parallel constructs and access by them to an) shared variables must be 
controlled externally if it is desired 

The Resolve construct is intended to produce a new parallel execution 
environment within each of its components, differing from the original only in 
the number of processes Thus all of the parallelism primitives have meaning 
within a foice component The implementations of the primitives must, of 
couise. refei to the parallel environment of the component rather than of the 
onginal foice The meaning of Bamer, as has been noted, can be extended to 
refer to higher levels of a nested component stiuctuie, but it retains its origi- 
nal meaning with respect to the immediate component with no modification 
of its semantics Barrier, Pease and the DOALLs have an action limited to 
the component m which they appear Critical sections and data oriented 
synchronizations can synchronize operations within the current component 
with operations in any other components which share the corresponding vari- 
ables 


Performance Issues 

Various features of the force methodology aie related to the performance 
of a parallel computer system An overall principle used in selection of primi- 
tive operations for inclusion in the force was that the semantics of each prim- 
itive should be simple enough to admit of an efficient implementation across 
the range of shared memory multiprocessors The simple process model, con- 
sisting of program counter, local variables and unique identifying index, also 
contributes to low overhead implementation on most shared memoiy 
machines Process priorities and parent-child relationships, for example, can 
significantly complicate the implementation of a parallel programming 
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system on some multiprocessors which do not directl) support such features 

The primitive operations of the force define a vntual machine, and the 
generality of this machine yields independence from the details of the under- 
lying hardware This benefit of machine independence and portability need 
not, however, suppress all machine performance issue's at the level of force 
programming Pratt [5] points out that a vntual machine for parallel execu- 
tion should make "visible," as programming alternatives, distinctions which 
may reflect major hardware performance differences The clearest example of 
such alternatives within the force is the existence of both a prescheduled and 
a self-scheduled DOALL 

At the level of the abstract machine, the process interactions implied by 
pre- and self- scheduling are different Prescheduling since it allocates index 
values to processes in a fixed way as soon as the number of processes is deter- 
mined, will split the workload evenly across processes only if processors run 
at similar speeds and the amount of computation specified by the DOALL 
body is independent of index value On the other hand, no process interac- 
tion is lequiied to allocate the index values, each process can determine its 
own portion of the work independently In contiast, the self-scheduling tech- 
nique allows processes to load balance at execution time by obtaining further 
index values whenever they complete the work connected with previous 
values This is done at the expense of a short cntical section to obtain, incre- 
ment and stole a shared index variable 

For a given undeilying hardwaie. these distinctions at the abstiact 
machine level can be tianslated into performance differences by using a few 
general chai actei istics of the hardware system The most important parame- 
ters for the pie- versus self- scheduling compauson aie the size, in execution 
time, of a minimal critical section to access and update a shared index and 
the number of processes competing for this access When combined with the 
program dependent parameters of the mean and standard deviation of the 
DOALL body size over the set of index values they allow a determination of 
which type of scheduling will lead to better performance 


Implementation Issues 

Implementation issues can be addressed on the basis of variations in the 
two current implementations Several hardware diffeiences between the IIEP 
and the Flex/32 multiprocessors influence implementation of the force mac- 
ros A minor, but basic level, difference is that all memory in the IIEP can be 
shared by all processes so only Fortran variable scope issues are involved in 
implementing global variables In the Flex/32, only a restricted portion of 
the address space is accessible by processes running on different piocessors so 
shared variables must physically reside in these addresses as well as satisfying 
Fortran conventions for name sharing by different modules The shaied 
address space on the Flex/32 is large enough and its access time near enough 
to that of local memory that this should not be an issue except for programs 
with very large global data requirements 
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The basic synchronization mechanism in HEP is the locking full/empty 
bit in each memory cell Locks in the Flex/32 are separate and, although 
there are 8102 of them, they form a scarcer resource than HEP synchroniza- 
tion elements Furthermore, since the HEP has hardware to support the tem- 
porary suspension of processes, the user can do synchronizations directly 
while the manipulation of locks in the Flev/32 must be done through the 
operating sjstem Figure 2 shows critical sections for both machines and 
notes the user instruction versus system call distinction The Flex/32 Con- 
Current C system supports the association of a lock with any shared variable 
to which synchronized access is made, so at this level the machine differences 
are not major, as far as implementation of the force macros is concerned As 
shown in Fig 2, the critical section macro has an associated variable to allow 
for distinct sets of interacting critical sections In both implementations this 
becomes a global variable which is locked (directly in the HEP and via sjstem 
call in the Flex/32) on entry to and unlocked on exit from the critical section 

The Produce and Use macros are quite different on the two sy stems sim- 
ply because they correspond directly to single memon access instructions on 
the HEP and must involve the locks on the Flex/32 Implementation of 

HEP 

Critical lock I call awrite(lockl, true) 

< code block > < code block > 

End Critical call laread(lock 1) 


Flex/32 


Critical lock 1 call CClock( 1, "lockl") 


<code block > 


<code block> 

End Critical 

call CCun 

lc k( 1 , "lockl") 

Single instruction HEP 
Fortran intrinsics 

Fle\/32 operating 
system calls 

awrite wait for empty, 

write, set full 

CClock 

wait for unlocked 
and lock 

laread wait for full, read, 

set empty (logical) 

CCunlck 

unlock 


Figure 2 Implementation of Critical Sections 
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Produce and Use for fewer than 8192 variables might be done using the 
hardware locks on the Flex/32, but full/empty access to individual elements 
of a large airay requires a software supported association of variables with 
full/empty bits Synchronized access to the bits and their associated vari- 
ables needs to employ a combination of the locks and events supported by 
the hardware 

Implementation of the Barrier macro shows some clear differences 
between the two systems In both cases, the semantics requiring all processes 
to arrive before the code section is executed is supported by a shared counter 
synchronized as in the critical section Two barrier mechanisms have been 
used on the I HOP In systems small enough that memory contention is not a 
problem, the' last process to increment the shared counter executes the code 
section and ! Ils a memory locatiou which the other piocesses arc- attempting 
to read Pn esses must then count down the count <-r as they exit the hai- 
rier, with tin last one resetting the lock If memory contention is a problem, 
the ability to contiol processes at the user level allows writing eff a IllsP 
assembly language routine in which all but the last piocess to enter a barrier 
terminate execution to be recreated with their previous state by the last pro- 
cess to enter the barrier In the Flex/32, process contiol is a system function 
The system, however, supports the concept of shaied events, connected to 
processes m a broadcast configuration Here, processes entering the barrier 
wait on the <-vent, except for the last one, which executes the code block of 
the barrier . n cl then activates the event Verifying that each process con- 
nected to tlu event has seen it is part of the operating system support, so no 
exit code is required The first mechanism for the HFP is contrasted with the 
Flex/32 impi mentation in Fig 3 


Conclusions 

The design of a parallel programming system in\ olves a combination of 
the issues of utility with those of implementation efficiency The utility 
issues have been treated in previous papers [1] [6] while this work concen- 
trates on the individual macro semantics and implementation issues The 
force methodology supports efficient implementation by the simplicity of its 
process model and lack of complex semantics in iiiclmdual paiallel ^in- 
structs At least two multiprocessors with shared memory admit of straight- 
forward implementations 
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HEP 


Barrier 


<code blo< k> 


End Barrier 


Barrier 


< code bio k > 


End Barrier 


if (lwaitf(ilock)) continue 
nloc = laread(nbar) + 1 
call awrite(nbar, nloc) 
if (nloc eq np) then 

<code block > 

call sete(ilock) 
call awrite(olock, true ) 
endif 

if (lwaitf(olock)) continue 
nloc = laroad(nbar) - 1 
call iawrite(nbar, nloc) 
if (nloc eq np) then 
call sete(oiock) 
call awrite(ilock, true ) 
endif 


Flex/32 


call CCkxkfl, "nbar") 
nloc = nbai + 1 
nbar = nloc 

call CCunlck(l, "nbar") 
if (nloc eq np) then 

< code block > 

call CCactev(l, 4, "bar") 
else 

call CCwev(l, 4, "bar") 
endif 


HEP 

smgl > instruction intrinsics 

Fle\/32 

operating system calls 

waitf 

- \ ait for full, read 

C Clock 

- wait free, set lock 

aread 

- \.ait full, read, set empty 

CCunlck 

- clear lock 

awrite 

- \ ait empty, write, set full 

CCactev 

- activate event 

sete 

- *-ot empty 

CCwev 

- wait for event 


Figure 3 Inij lementation of Barriers 
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